Method and apparatus for hybrid compression processing for high levels of compression

ABSTRACT

In one embodiment, an apparatus comprises a first compression engine to receive a first compressed data block from a second compression engine that is to generate the first compressed data block by compressing a first plurality of repeated instances of data that each have a length greater than or equal to a first length. The first compression engine is further to compress a second plurality of repeated instances of data of the first compressed data block that each have a length greater than or equal to a second length, the second length being shorter than the first length, wherein each compressed repeated instance of the first and second pluralities of repeated instances comprises a location and length of a data instance that is repeated. The apparatus further comprises a memory buffer to store the compressed first and second plurality of repeated instances of data.

RELATED APPLICATION

This Application is a continuation (and claims the benefit of priorityunder 35 U.S.C. § 120) of U.S. application Ser. No. 15/277,119, filedSep. 27, 2016, issued as U.S. Pat. No. 9,825,648 and entitled METHOD ANDAPPARATUS FOR HYBRID COMPRESSION PROCESSING FOR HIGH LEVELS OFCOMPRESSION. The disclosure of the prior Application is incorporated byreference in the disclosure of this Application.

FIELD

The present disclosure relates in general to the field of computerdevelopment, and more specifically, to data compression.

BACKGROUND

A computing system may include one or more processors, one or morememory devices, and one or more communication controllers, among othercomponents. Logic of the computing device may be operable to access andcompress a data set.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a block diagram of components of a computer system inaccordance with certain embodiments.

FIG. 2 illustrates a flow for compressing a data set in accordance withcertain embodiments.

FIG. 3 illustrates a flow for generating a first compressed data set inaccordance with certain embodiments.

FIG. 4 illustrates a flow for further compressing a first compresseddata set to generate a second compressed data set in accordance withcertain embodiments.

FIG. 5 illustrates a first exemplary compressed data set in accordancewith certain embodiments.

FIG. 6 illustrates a second exemplary compressed data set in accordancewith certain embodiments.

FIG. 7 illustrates an example block diagram of a field programmable gatearray (FGPA) in accordance with certain embodiments.

FIG. 8A is a block diagram illustrating both an exemplary in-orderpipeline and an exemplary register renaming, out-of-orderissue/execution pipeline in accordance with certain embodiments.

FIG. 8B is a block diagram illustrating both an exemplary embodiment ofan in-order architecture core and an exemplary register renaming,out-of-order issue/execution architecture core to be included in aprocessor in accordance with certain embodiments;

FIGS. 9A-B illustrate a block diagram of a more specific exemplaryin-order core architecture, which core would be one of several logicblocks (potentially including other cores of the same type and/ordifferent types) in a chip in accordance with certain embodiments;

FIG. 10 is a block diagram of a processor that may have more than onecore, may have an integrated memory controller, and may have integratedgraphics in accordance with certain embodiments; and

FIGS. 11-14 are block diagrams of exemplary computer architectures inaccordance with certain embodiments.

Like reference numbers and designations in the various drawings indicatelike elements.

DETAILED DESCRIPTION

Although the drawings depict particular computer systems, the conceptsof various embodiments are applicable to any suitable integratedcircuits and other logic devices. Examples of devices in which teachingsof the present disclosure may be used include desktop computer systems,server computer systems, storage systems, handheld devices, tablets,other thin notebooks, systems on a chip (SOC) devices, and embeddedapplications. Some examples of handheld devices include cellular phones,digital cameras, media players, personal digital assistants (PDAs), andhandheld PCs. Embedded applications may include a microcontroller, adigital signal processor (DSP), a system on a chip, network computers(NetPC), set-top boxes, network hubs, wide area network (WAN) switches,or any other system that can perform the functions and operations taughtbelow. Various embodiments of the present disclosure may be used in anysuitable computing environment, such as a personal computing device, aserver, a mainframe, a cloud computing service provider infrastructure,a datacenter, a communications service provider infrastructure (e.g.,one or more portions of an Evolved Packet Core), or other environmentcomprising a group of computing devices.

FIG. 1 illustrates a block diagram of components of a computer system100 in accordance with certain embodiments. System 100 may include(among other components) compression engines 102, input buffer 104, andoutput buffer 122. During operation, any suitable component of computersystem 100 may store a data set comprising a plurality of bytes (orother logical data groupings). In various systems, a data set may bestored in an uncompressed manner. However, such a method of storage maybe expensive in terms of memory usage and bandwidth. Various embodimentsof the present disclosure may provide techniques to efficiently compressany suitable types of data sets, such as data sets used in databaseapplications, storage applications, networking applications, or othersuitable applications.

Various embodiments of the present disclosure may improve the speed andcompression ratios of various compression algorithms such as LZ77-basedcompression algorithms, including DEFLATE, LZO, LZS, LZF, LZ4, SNAPPY,and other compression algorithms. An LZ77-based algorithm identifiesrepeated data sequences and replaces them with backward references(having relative distance offsets) to previous instances of the samedata sequences. The compressed data output by an LZ77-based algorithmincludes a series of elements of two types: literals (data sequences forwhich previous instances were not found) and matches that include alength attribute and a position attribute (that refer back to a previousinstance of a matching data sequence).

LZ77-based compression algorithms seek to find good data sequencematches at each position of a data set. The compression processtypically searches a large number of locations and determines thelongest match and/or other suitable match (in some algorithms in orderto save time a “good enough” match length may be defined). The speed ofa compression algorithm (whether implemented in hardware and/orsoftware) is generally limited by the number of comparisons performed ateach position of the data set, which may be closely related to thenumber of data sequences in a history buffer that have the same shortprefix as the target data sequence at the current position. The shortprefix at the current position may be hashed and the result may be usedto identify other data sequences having the same hash value (which arethen compared against the short prefix to determine whether a matchexists). The length of the short prefix (i.e., the data that is hashedand forms at least a portion of data compared against possible matches)is typically set to the minimum length match of the particular algorithm(i.e., the minimum length for matches that will be encoded by matchlength and position attributes). As one example, the minimum lengthmatch for the DEFLATE compression algorithm is three bytes. Suchimplementations may result in possible matches in the history buffer notbeing missed, but short prefix lengths tend to make the resulting hashchains (i.e., the set of locations associated with a particular hashvalue that store potentially matching data sequences) very long, causinga significant increase in compression processing times. Increasing thelength of the short prefix that is hashed will speed up processing(since the hash chains will be shorter), but will result in a severeloss of compression, since all or most matches shorter than the minimumlength will not be found (and many matches in typical compressed streamstend to be of minimum size).

Various embodiments propose a hybrid compression scheme comprising afirst compression engine that compresses an input data set by encodingmatches having a first minimum length and placing the remaininguncompressed data sequences (i.e., literal sequences) into anintermediate data set with the encoded matches and a second compressionengine that further compresses the intermediate data set by encodingmatches having a second minimum length (which is shorter than the firstminimum length) found in the uncompressed data sequences of theintermediate data set. In various embodiments, the second compressionengine may then encode the resulting data set further (e.g., usingHuffman encoding). In various embodiments, the second compression enginedoes a quick search for matches of the second length, comparing theshort prefix against a maximum of one data sequence in the historybuffer. In various embodiments, the first compression engine may be ahardware accelerator that accepts offloaded processing tasks from a CPU,while the second compression engine may be a CPU. In a particularembodiment, the first minimum length is four and the second minimumlength is three.

Various embodiments of the present disclosure may provide technicaladvantages, such as reducing the amount of storage used to store acompressed data set, reducing the bandwidth used to transfer compresseddata, increasing the speed of data compression, increasing the ratio ofdata compression, and other technical advantages.

In the embodiment depicted, compression engines 102A and 102B eachinclude a compare buffer 106, hash table control logic 108, hash table110, history buffer 112, compare logic 114, intermediate buffer 116, andcompression control logic 118. In addition, compression engine 102Bincludes encoding logic 120. In other embodiments, the first and/orsecond compression engine may include other components or may omit anyof the depicted components. The first and second compression engines mayinclude any suitable set of components for performing the functionsdescribed herein. Components that appear in both compression engines aredescribed collectively below, but should be understand to interact withother components from their respective compression engine, unlessotherwise noted.

Input buffer 104 may include any suitable memory (including any of thetypes of memories referred to herein or other types of memories) forstoring an input data set that is to be compressed. In variousembodiments, input buffer 104 may be located within compression engine102A or outside of compression engine 102A.

Compare buffer 106A may store data from the input buffer 104 and comparebuffer 106B may store data from the intermediate buffer 116A ofcompression engine 102A (or other memory that holds a compressed dataset output by compression engine 102A or the original input data set).In a particular embodiment, at various points in time, each comparebuffer may store a short prefix of data (which in various embodiments isequal to the minimum match length compressed by the respectivecompression engine 102) as well as a data sequence immediately followingthe short prefix. The short prefix stored by compare buffer 106A ishashed and compared (along with the data sequence following the shortprefix) against potential matches. In various embodiments, the length ofa short prefix used by compression engine 102A is larger than the lengthof a short prefix used by compression engine 102B. In a particularembodiment, the short prefix of compression engine 102A is one bytelarger than the short prefix of compression engine 102B. As one example,compare buffer 106A may store a short prefix that is 4 bytes long andcompare buffer 106B may store a short prefix that is 3 bytes long. Asanother example, compare buffer 106A may store a short prefix that is 5bytes long and compare buffer 106B may store a short prefix that is 4bytes long. In other embodiments, the short prefix of compression engine102A may be more than one byte larger than the short prefix ofcompression engine 102B. In various embodiments, the size of the shortprefixes may be dynamically configurable based on the compressionalgorithm being performed.

The data stored in a compare buffer 106 may change as various positionsin the input data set are checked for matches. In a particularembodiment, the compare buffer data advances by one byte (or other dataamount) for each set of operations performed by a compression engine(where a set of operations could include hashing the compare bufferdata, checking for one or more matches, and/or updating the respectivehash table). Thus, when the compare buffer data is updated, the oldestbyte (or other data amount) of the compare buffer data may be removedfrom the compare buffer and a new byte (or other data amount) from therespective input buffer (e.g., input buffer 104 for compare buffer 106Aor intermediate buffer 116A for compare buffer 106B) may be added to theend of the compare buffer. Other methods for advancing through the inputdata is also contemplated by this disclosure. In various embodiments, aseparate compare buffer 106 is not used and the short prefix and/or thedata sequence following the short prefix is accessed directly (e.g., byhash table control logic 108 or compare logic 114) from another buffer(e.g., input buffer 104 or intermediate buffer 116A). Thus, when datafrom the compare buffer 106 is referred to herein, such data may alsorefer to similar data from any other suitable buffer (e.g., input buffer104, intermediate buffer 116A, or other suitable buffer) in variousembodiments.

Hash table control logic 108 includes logic to perform hashes on theshort prefix stored in the compare buffer 106, access entries of hashtable 110, add entries to or delete entries from hash table 110, and/orperform other operations associated with hash table 110. Logic 108 mayimplement any suitable hash function. Typically, the hash value outputby the hash function will be shorter than the input to the hash function(i.e., the short prefix). In particular embodiments, this may result inaliasing where the hash function may output the same hash value formultiple different input values (this is resolved by comparing the shortprefix along with subsequent data against one or more data sequences atlocations associated with the hash value to determine whether the shortprefix and subsequent data match the data sequences).

Hash table 110 may be implemented using any type of memory elements(including those described herein or other memory elements) and usingany suitable data structure providing associations between hash valuesand pointers (where each pointer provides an indication of a location ofa data sequence). Each hash value of hash table 110 may be associatedwith any number of pointers that point to locations of data sequencesthat produced the same hash value). For example, a hash value may be anindex to a portion of a table or other data structure, where the portionincludes one or more pointers associated with that hash value. In aparticular embodiment, each hash value used to access hash table 110Amay be associated with any number of pointers but each hash value usedto access hash table 110B may be associated with a maximum of onepointer (and/or only one pointer is accessed when the hash of thecurrent short prefix matches that particular hash value). In such anembodiment, compression engine 102B may be configured to do a quickcompare of its short prefix (and subsequent data) (by comparing itagainst a maximum of one data sequence indicated by a pointer of hashtable 110B) while compression engine 102A may be configured to do a morethorough compare of its short prefix (and subsequent data) to one ormore data sequences at various locations indicated by multiple pointersof hash table 110A. In other embodiments, each hash value of hash table110B may be associated with a maximum of two pointers or three pointers,or other limited number of pointers and/or the compression engine 102Bis configured to do a comparison on a maximum of two or three datasequences associated with the hash value obtained from the current shortprefix. Such embodiments may provide particular speed advantages whenthe first compression engine is implemented in hardware (e.g., via ahardware accelerator) and the second compression engine is implementedin software (e.g., via a processor executing software instructions) orin other embodiments.

In various embodiments, the pointers of the hash table 110 may refer todata stored in a history buffer 112. Data that has cycled through thecompare buffer 106 may be placed in history buffer 112 (in variousembodiments the data could be received from the compare buffer 106 orother suitable buffer such as input buffer 104). Thus, the data storedby history buffer 112 may represent at least a portion of the input dataset (which in various embodiments would be the data initially stored ininput buffer 104). In various embodiments, the data stored in thehistory buffer is shorter than the entire data set, thus the historybuffer 112 may store the most recent data (i.e., the data that was mostrecently hashed and/or analyzed for matches). In other embodiments, thepointers of the hash table 110 may refer to any other suitable locationof data sequences (e.g., locations in the input buffer 104, intermediatebuffer 116A, or other suitable buffer). The pointers may have anysuitable format. For example, the pointers may store an absolute addressor a relative address (e.g., with respect to a current location of theinput data set) of a location.

In various embodiments, the maximum number of pointers associated with ahash value and/or the maximum number of pointers that may be accessed bya compression engine 102 when looking for a match for a particular shortprefix is reconfigurable (e.g., based on the level of compressionselected by a user). In various embodiments, a compression engine 102may be configured to stop looking for a match when a “good enough” matchhas been found (even if all of the pointers associated with the currenthash value have not yet been accessed or if the maximum number ofpointers allowed to be searched have not yet been accessed).

Compare logic 114 is operable to compare the short prefix (andsubsequent data) against one or more data sequences (e.g., the locationsof which may be indicated by one or more pointers associated with thehash value of the short prefix in the hash table 110). In a particularembodiment, the data sequences compared against the short prefix (andsubsequent data) are obtained from the history buffer 112 (though theymay be obtained from any suitable memory location). In addition tocomparing the short prefix against one or more data sequences, thecompare logic 114 may compare bytes (or other data amounts) immediatelyfollowing the short prefix against bytes immediately following the datasequences indicated by the pointers of the hash table 110 to determinewhether the match is longer than the length of the short prefix. In aparticular embodiment, a comparison may involve comparing a datasequence from the input data set that begins with the short prefix(e.g., the data currently stored in compare buffer 106) against a datasequence at a location indicated by a pointer of the hash table 110(e.g., a similarly sized amount of data from history buffer 112) anddetermining the length of the match based on this comparison. Forexample, if the compare buffer stored 256 bytes of data starting with ashort prefix of 3 bytes, the short prefix and the next 253 bytes in thecompare buffer could be simultaneously compared against 256 bytes of thehistory buffer 112 (or other suitable buffer) in order to determine howmany consecutive bytes (starting with the first byte of the shortprefix) match.

Intermediate buffer 116A may store data compressed by compression engine102A and intermediate buffer 116B may store data compressed bycompression engine 102B. Intermediate buffers 116A and 116B may beimplemented using any type of memory elements (including those describedherein or other memory elements). In some embodiments, intermediatebuffers 116A or 116B may comprise registers, cache memory, and/or othersuitable memory elements. The data stored in intermediate buffer 116 maycomprise matches (i.e., repeated data sequences that have been encodedin a compressed format) and literal sequences (i.e., uncompressed datasequences that each comprises one or more literal bytes or othergrouping of literal data). The matches and literal sequences may bestored in intermediate buffer 116 using any suitable format. In aparticular embodiment, a match comprises a length attribute indicatingthe length of a repeated data sequence and a location attributeindicating a location of an earlier instance of the repeated datasequence. The location attribute may take any suitable form, such as arefer-back distance that indicates how far back (e.g., in bytes or otherdata unit) the earlier instance of the data sequence appears in theoriginal data set or other suitable form (such as an absolute positionwithin the original data set).

In a particular embodiment, a compression engine 102 stores the data ofintermediate buffer 116 in a series of records that each includes anencoding of one literal sequence and one match. Each literal sequenceand match of a record have their own length attribute that may be zeroor greater (although a literal sequence and a match in the same recordwill not each have a length of zero). A length of zero for a literalsequence signifies that there is no literal data in the literalsequence. For example, if two matches were found adjacent to each otherin the input data set, a literal sequence may be encoded with a lengthof zero to indicate that there is no uncompressed literal data inbetween the two matches. Similarly, a length of zero for a match mayindicate that no match exists between two literal sequences (e.g., dueto hardware limitations or other design constraints, there may be amaximum length for literal sequences and thus a zero length match may beencoded in between two encoded literal sequences that represent onelonger literal sequence). Each record includes an indication of theliteral sequence length, the literal sequence value (when the length ofthe literal sequence is greater than zero), the match location (e.g.,this may be an offset from the current position in the data set), andthe match length. In one embodiment, the match length of a match is theactual length of the match (as opposed to a format such as LZ4 in whichthe encoded match length is a relative match length that takes intoaccount the minimum match length, thus an encoded match length of zeroin LZ4 may indicate a match length of four bytes since four bytes is theminimum match length). In other embodiments, a separate record maycomprise an encoding of a literal sequence or an encoding of a matchinstead of an encoding of both a literal sequence and a match.

Compression control logic 118 may provide operations of compressionengine 102 that are not performed by other components of compressionengine 102. For example, compression control logic 118 may control theadvancement of the short prefixes through the input data set, thecopying of data into the history buffer 112, the determination (inconjunction with the compare logic 114) of a best match, the encoding ofmatches and literal sequences, communication with one or more othercompression engines 102 (e.g., to instruct another compression enginethat data is ready for compression), or other suitable operations.

Encoding logic 120 of compression engine 102B is operable to access thedata compressed by compression engine 102B (which may be stored inintermediate buffer 116B), encode the data in order to further compressthe data, and store the encoded data in output buffer 122. Encodinglogic 120 may encode the data in any suitable manner. As one example,encoding logic 120 may analyze the data set stored in intermediatebuffer 116B to determine which data sequences occur most frequently inthe data set and may generate a list of codes, where shorter codes aremapped to the most frequently occurring data sequences, and may thenencode the data set using the codes. In one embodiment, encoding logic120 may employ Huffman encoding to encode the data set. In someembodiments, encoding 120 logic may generate multiple code tables (e.g.,one for literal values and one for match locations) and encode separatedata types of the data set based on the multiple code tables. Theencoded data is placed in output buffer 122, which may include anysuitable memory (including the types of memory described herein or othertypes of memory) for storing a compressed data set. In variousembodiments, encoding logic 120 and/or output buffer 122 may be locatedwithin compression engine 102B or outside of compression engine 102B.

Although in various embodiments compression engine 102A may beimplemented using any suitable logic, in a particular embodimentcompression engine 102A is a hardware accelerator that includesspecialized hardware (which may or may not be reconfigurable)implementing the logic of compression engine 102A (e.g., hash tablecontrol logic 108A, compare logic 114A, and/or compression control logic118A). As various examples, compression engine 102A may be implementedby a coprocessor, an FPGA, an ASIC, or other suitable logic. In variousembodiments, the logic (e.g., compare buffer 106A, hash table controllogic 108A, hash table 110A, history buffer 112A, compare logic 114A,intermediate buffer 116A, and/or compression control logic 118A) ofcompression engine 102A is dedicated to performing compressionoperations (i.e., the logic is not used for other types of operations asit would be if compression engine 102A were a general purposeprocessor).

Although in various embodiments compression engine 102B may beimplemented using any suitable logic, in a particular embodimentcompression engine 102B is a processor (e.g., a CPU) that executes,using one or more processor cores, software instructions to implementthe functionality of the logic of compression engine 102B (e.g., hashtable control logic 108B, compare logic 114B, compression control logic118B, and/or encoding logic 120).

In one embodiment where compression engine 102B is a processor,compression engine 102B may receive an instruction to compress a dataset and may instruct compression engine 102A to perform a firstcompression of the data set. In various embodiments, compression engine102B may communicate the minimum length matches or other parametersassociated with the compression to be performed to compression engine102A and the compression engine 102A is configured accordingly (in otherembodiments, compression engine 102A may already be configuredappropriately prior to receiving the instruction to compress the dataset). Once compression engine 102A has finished compressing all or aportion of the data set, compression engine 102B may retrieve output ofcompression engine 102A, further compress the data set (or portionthereof) and/or encode the compressed data set (according to variousembodiments described herein), and provide the result (e.g., to outputbuffer 122). In other embodiments, a different processor may receive aninstruction to compress a data set and may instruct compression engine102A in a similar manner as well as instruct compression engine 102B tofurther compress the output of compression engine 102B.

FIG. 2 illustrates a flow 200 for compressing a data set in accordancewith certain embodiments. The operations, along with any otheroperations described herein relating to compression of a data set, maybe performed by any suitable logic, such as compression engines 102 orother suitable logic.

At 202, repeated data sequences of a first length or greater in an inputdata block are identified. For example, compression engine 102A mayiterate through data stored in an input buffer as described abovelooking for data sequences that match earlier data sequences of the datablock (e.g., which may be stored in a history buffer 112A after theyhave been used as part of a short prefix, e.g., after they pass throughthe compare buffer 106). At 204, a first compressed data block withencoded matches is generated. For example, compression engine 102A mayencode the repeated data sequences it finds as matches comprising amatch length and a location (such as a refer-back offset from thecurrent position). The first compressed data block may also comprisevarious literal sequences for which no matches were found. The firstcompressed data block may be stored in intermediate buffer 116A.

At 206, repeated data sequences of a second length are identified inuncompressed portions of the first compressed data block. For example,compression engine 102B may iterate through the output of the firstcompression engine 102A (e.g., the data stored in intermediate buffer116A) as described above looking for data sequences that match earlierdata sequences of the input data block (which may be stored in a historybuffer 112B after they have been used as part of a short prefix bycompression engine 102B). Specifically, the compression engine 102Blooks for repeated data sequences of the second length within theuncompressed portions (i.e., the literal sequences) of the output ofcompression engine 102A.

At 208, a second compressed data block with matches of the first lengthor greater and matches of the second length or greater is generated. Forexample, compression engine 102B may encode the identified repeated datasequences of the second length (or greater) as matches comprising amatch length and a position (such as a refer-back offset) and place theencoded matches with the encoded matches from the output of the firstcompression engine into a buffer (e.g., intermediate buffer 116B). Thesecond compressed data block may also include uncompressed portions(i.e., literal sequences) for which no matches of the second length (orgreater) were found.

At 210, one or more encoding tables for the second compressed data blockare generated. For example, Huffman encoding tables may be generated forthe match locations and the literal values of the second compressed datablock. At 212, an output data block is generated based on the one ormore encoding tables. In various embodiments, the output data block isgenerated by the second compression engine 102B, though it may begenerated by any suitable logic of a computer system.

The flow described in FIG. 2 is merely representative of operations thatmay occur in particular embodiments. In other embodiments, additionaloperations may be performed. Various embodiments of the presentdisclosure contemplate any suitable signaling mechanisms foraccomplishing the functions described herein. Some of the operationsillustrated in FIG. 2 may be repeated, combined, modified or omittedwhere appropriate. Additionally, operations may be performed in anysuitable order without departing from the scope of particularembodiments.

FIG. 3 illustrates a flow 300 for generating a first compressed data setin accordance with certain embodiments. The operations, along with anyother operations described herein relating to compression of a data set,may be performed by any suitable logic, such as compression engine 102Aor other suitable logic.

At 302, a compare buffer (e.g., compare buffer 106A) is updated. Thefirst time the compare buffer is updated, an amount of data matching thesize of the compare buffer may be copied from an input buffer into thecompare buffer. In one embodiment, the size of the short prefix used bythe compression engine (and stored in the compare buffer) is four bytes,though any other suitable amount of data may be used. During subsequentupdates of the compare buffer, a portion from one end of the comparebuffer (e.g., the first byte or other portion of the most recent shortprefix) may be dropped from the compare buffer and a portion of theinput buffer 104 (e.g., the next byte) may be added to the comparebuffer. In various embodiments, the data dropped from the compare buffermay be placed into history buffer 112A. In various embodiments, updatingthe compare buffer may include copying data from the input buffer tocompare buffer 106A. In other embodiments that do not utilize a separatecompare buffer, control circuitry may be updated at this point such thatupdated data from input buffer 104 is provided to logic that operates onthe short prefix and subsequent data sequence (e.g., hash table controllogic 108A or compare logic 114A).

At 304, the short prefix is hashed to produce a hash value. At 306, itis determined whether one or more valid pointers in the hash table 110Aare mapped to the hash value. For example, the hash value may be used asan index into the hash table 110A. If one or more pointers exist at thelocation identified by the hash value, a determination may be made as towhether any of the pointers are valid at 306. As one example, a pointermay be invalid if it points to data that is no longer in the historybuffer (i.e., the data is too far back in the data set). If no validpointers are found in the hash table, then the flow proceeds to 316. Ifone or more valid pointers are found at 306, then locations identifiedby the pointers are accessed at 308 to determine whether data sequencesat the locations match at least the short prefix of the compare buffer.In one embodiment, this may comprise comparing the contents of thecompare buffer 106 (e.g., the entire compare buffer 106 or a portionthereof) to similarly sized data sequences beginning at the locationsidentified by the pointers to determine the length of the match(starting with the first byte of the short prefix). In anotherembodiment, determining whether data sequences at the locations matchthe data sequence in the window may comprise comparison of the shortprefix to the same length of data at the locations identified in thehash table. If a match is found between the short prefix and a datasequence at a location identified by a pointer of the hash table, thendata subsequent to the short prefix in the input buffer is compared todata subsequent to the data sequence at the location identified by thepointer of the hash table to determine the length of the match. Invarious embodiments, comparisons of various lengths of data may be madesimultaneously in order to determine what the match length is.

At 310, if no matches are found (e.g., because dissimilar data from aprevious portion of the input data set produced the same hash value),the flow moves to 316 where a portion of the data (e.g., the first byte)of the compare buffer (e.g., the portion that will be dropped from thecompare buffer and/or placed into the history buffer the next time thecompare buffer is updated) is encoded within a literal sequence (or atleast set aside for encoding within a literal sequence since theencoding of a literal sequence may utilize the length of the literalsequence which is generally not known until the next match is found).

If one or more matches are found at 310, the best match is encoded at312. The best match may comprise the longest match found afterperforming compare operations for data sequences at each valid pointerin the hash table, the longest match found after performing compareoperations for a maximum number of data sequences at locationsidentified by valid pointers of the hash table (e.g., the maximum numbermay be specified by the particular compression level of the algorithmbeing performed), a match having a length equal to a specified length(e.g., when the comparison operations stop once a “good enough” match isfound), or other suitable match depending on the algorithm used. Thematch may be encoded in any suitable format. In a particular embodiment,the match is encoded by a length attribute specifying the length of thematch and a location attribute (e.g., a refer-back offset) specifyingthe location of an earlier instance of the matching data sequence.

At 314, the hash table is updated. For example, a pointer to the currentlocation is added to the hash table at the index of the hash value (suchthat future short prefixes that resolve to hash values matching the hashvalue may be compared against the current short prefix). In someinstances, the pointer may be the first pointer associated with the hashvalue in the hash table or the pointer may be added to the set of one ormore pointers already associated with the hash value (or the pointer mayreplace the oldest pointer currently associated with the hash value).

At 318, it is determined whether the input buffer 104 includesadditional data to be analyzed. If it does not, the flow is finished. Ifit does, the flow moves back to 302 where the compare buffer is updated.If a match was found in the previous iteration, then compare operations(e.g., 306, 308, and 310) and data encoding operations (e.g., 312 and316) may be skipped until the compare buffer no longer includes datathat was part of the match. However, the hash operations (e.g., 304 and314) may continue with respect to data that was part of the match sothat subsequent data sequences may be compared against the variousportions of the match.

The flow described in FIG. 3 is merely representative of operations thatmay occur in particular embodiments. In other embodiments, additionaloperations may be performed. Various embodiments of the presentdisclosure contemplate any suitable signaling mechanisms foraccomplishing the functions described herein. Some of the operationsillustrated in FIG. 3 may be repeated, combined, modified or omittedwhere appropriate. Additionally, operations may be performed in anysuitable order without departing from the scope of particularembodiments.

FIG. 4 illustrates a flow 400 for further compressing a first compresseddata set to generate a second compressed data set in accordance withcertain embodiments. The operations, along with any other operationsdescribed herein relating to compression of a data set, may be performedby any suitable logic, such as compression engine 102B or other suitablelogic. Various flow operations are described with respect to FIG. 4under the assumption that the first compressed data set comprises aplurality of records that each includes an encoding of a literalsequence and a match. However, the present disclosure contemplates anysuitable format for storing the literal sequences and matches within thefirst compressed data set and for processing them when generating thesecond compressed data set. Various operations of flow 400 may besimilar to operations of flow 300 and may be performed in similarmanners, adapted of course to the parameters of the compression engine102B (even if the various characteristics of the operations are notexplicitly described below).

At 402, it is determined whether data from the first compressed data setremains to be processed. If further data is to be processed, a nextrecord of the first compressed data set is accessed at 404. At 406, thelength of the literal sequence of the record is decoded. As just oneexample, a header of the record that specifies the length of the literalsequence may be accessed and the length decoded from the header. At 408a literal length variable (LL_VAR) is set based on the decoded literalsequence length and the minimum match length to be found by the secondcompression engine (102B). In this example, the minimum match length isassumed to be three bytes. Accordingly, at 410, LL_VAR is set to thedecoded literal sequence length minus three.

At 410, it is determined whether LL_VAR is greater than or equal tozero. In this example, if the decoded length of the literal sequence wasone or two (and thus the initial LL_VAR value would be less than zero),various operations of the flow may be skipped since the compressionengine 102B does not compress literal sequences that are shorter thanthe minimum length. When LL_VAR is less than zero, the flow moves to 412where a match length identified in the record is decoded. If the matchlength is zero (signifying no actual encoded match), the flow returns tothe beginning at 402. If the match length is not zero, then the match inthe record (which was already encoded by the other compression engine102A) is passed to the output at 414. Alternatively, if the format ofthe output of the second compression engine 102B differs from the formatof the output of the first compression engine 102A, the encoded matchmay be modified to comply with the format of the output of the secondcompression engine 102B at 414. At 416, the hash table 110B of thesecond compression engine 102B is updated for each location in thematch. That is, the data represented by the match is iterated throughand for each short prefix (having the minimum length used by compressionengine 102B), a hash is performed and a corresponding pointer to thelocation of the data sequence is associated with the hash value andadded to the hash table. In various embodiments, this may involvedecoding the match to reproduce the original data. In other embodiments,the original data set may be accessible by the compression engine 102Bsuch that the match does not need to be decoded in order to hash eachshort prefix in the match (in such embodiments, the correspondinglocation in the original data set may be tracked throughout the flow400). After the hash table is updated, the flow returns to 402. Notethat a portion of data represented by the match could also be used whena hash is performed to update the hash table (e.g., when the shortprefix is longer than the remaining amount of data in a literalsequence, the short prefix may be filled out with data from the nextmatch).

At 410, if LL_VAR is greater than or equal to zero, then the literalsequence is searched for matches having at least the minimum length. At418, the data of a compare buffer (e.g., compare buffer 106B) isupdated. This operation may be similar to operation 302, but in thiscase the data for the compare buffer 106B is obtained from a literalsequence in a record of the first compressed data set stored, e.g., inintermediate buffer 116A (or from a corresponding location in the inputbuffer 104 or other buffer that stores the input data set). In variousembodiments, the short prefix (that may be stored in the compare buffer106B) used by compression engine 102B has a smaller size than the shortprefix used by compression engine 102A. At 420, the short prefix ishashed to obtain a hash value. The hash value is looked up in a hashtable and a determination is made as to whether the hash value isassociated with a valid pointer at 422. If it is not, a portion of thedata (e.g., the first byte) of the compare buffer (e.g., the portionthat will be dropped from the compare buffer and/or placed into thehistory buffer the next time the compare buffer is updated) is encodedwithin a literal sequence (or at least set aside for encoding within aliteral sequence since the encoding of a literal sequence may utilizethe length of the literal sequence which is generally not known untilthe next match is found) at 426.

If a valid pointer is found at 422, then a comparison between the shortprefix (and subsequent data from the literal sequence) and the datasequence at the location identified by the pointer is made at 426. If nomatch is found, the flow moves to 424. If a match is found, the literalbytes of the data sequence (starting with the short prefix) that arefound to be matching are replaced with an encoded match (which may be ofthe minimum length or a greater length) at 428. The match may berepresented in a similar manner as the matches of the first compresseddata set (i.e., the match has a length attribute and a positionattribute). In a particular embodiment, only a single locationidentified by a pointer is accessed for the comparison (in order to savetime). However, in other embodiments, multiple locations identified bymultiple pointers associated with the hash value could be accessed forcomparison against the short prefix (and subsequent data of the literalsequence).

At 430, the hash table 110B is updated with an association between thehash value and a pointer to the current location. At 432, LL_VAR isupdated. If no match was found at 426, LL_VAR may be decremented by oneand the compare buffer (and short prefix stored therein) will advance byone byte (or other data amount) within the literal sequence that isbeing checked for matches. It should be noted that if a match was foundat 426, operations 422, 424, 426, and 428 may be skipped so as toadvance the compare buffer past the match before additional comparisonand encoding operations are performed on the literal sequence. In such acase, operation 430 may still be performed to updated the hash table foreach position within the match to allow subsequent data to be comparedagainst each of the short prefixes that begins with data from the match.

The flow described in FIG. 4 is merely representative of operations thatmay occur in particular embodiments. In other embodiments, additionaloperations may be performed. Various embodiments of the presentdisclosure contemplate any suitable signaling mechanisms foraccomplishing the functions described herein. Some of the operationsillustrated in FIG. 4 may be repeated, combined, modified or omittedwhere appropriate. Additionally, operations may be performed in anysuitable order without departing from the scope of particularembodiments.

FIG. 5 illustrates a first example compressed data set 508 in accordancewith certain embodiments. The first example hash table 506 and firstcompressed data set 508 represent a hash table and data set that may begenerated by, e.g., a first compression engine 102A that is configuredto encode a minimum match length of four bytes and to store the encodeddata in records 510. An example input data set 502 is shown, with eachcharacter of the data set representing a distinct byte. The hash table506 (or other hash table used by either compression engine) may includea plurality of pointers that are each associated with a hash value. Ahash value may be associated with one or more of the pointers. Theassociation may be made in any suitable manner. In one example, the hashvalue is an index that identifies a location within a data structurestoring the pointers, such that when the data structure is access at theindex, the associated pointer(s) may be accessed. In another embodiment,the hash value could be stored in the data structure along with theassociated pointers or associated with the pointers in any othersuitable manner.

Short prefix table 504 depicts the values of the short prefixes atvarious points during the compression. The first short prefix includesthe first four bytes (ABCD) from the input data set. This value ishashed to produce a hash value (ABCD_(H)). At this point, the hash valueis not associated with any pointers, thus no comparison is made and thefirst byte of the short prefix (A) is designated for encoding within aliteral sequence in record 510A. The hash value is associated with apointer (0) to the current location in the hash table 506. The shortprefix is updated to the next position and now comprises BCDE. Invarious embodiments, the data (A) dropped from the short prefix may beadded to a history buffer at a location corresponding to the location(0) that was just added to the hash table. A hash function is performedon BCDE and because there are no matches, B is encoded as a literal(within a literal sequence) and BCDE_(H) is associated with the currentposition (1) in the hash table.

The iterations may continue in this manner until the short prefix againincludes the value ABCD. At this point, a hash value (ABCD_(H)) iscalculated and the hash table is accessed based on the hash value. Thehash table does include a pointer (0) associated with the hash value,therefore data at the location identified by the pointer is comparedagainst data that begins with the data in the current position (9) and amatch is found. In fact, the match length is longer than the minimummatch length since the data sequence ABCDE which begins at the currentposition (9) is a repeat of the data sequence ABCDE which begins at thelocation of the pointer (0). At this stage, the first record 510A may becreated. The record includes a literal sequence with a length of nineand a value of ABCDEABCF. The record also includes a match with a lengthof five and a distance of nine which represents the location of thebeginning of the repeated data sequence (in alternative embodiments thelocation may be represented in any other suitable manner) with respectto the current position. Hashing operations are continued on the shortprefix data as the data advances, but no comparisons are made or literalencoding performed on the data that is already encoded in the mostrecent match. Accordingly, a new pointer is associated with BCDE_(H),and hash values CDEK_(H), DEKA_(H), and EKAB_(H) are associated withtheir respective pointers in the hash table. Once the data alreadyencoded in the match has been flushed from the short prefix, thecomparison operations and literal encoding operations may resume. Forexample, when the short prefix includes KABC, the hash table is accessedto determine whether a valid pointer associated with KABC_(H) exists (itdoes not and thus KABC_(H) is associated with a pointer to the currentposition in the hash table). Accordingly, the value K from the shortprefix is designated for encoding in a literal sequence. The shortprefix is updated to ABCD. ABCD is hashed and the data sequences atlocations indicated by the pointers associated with hash value ABCD_(H)are compared with the current short prefix (ABCD) and subsequent data(GHI) to determine whether a match exists (and if multiple matches existto determine the best match). In this case, two matches of equal length(4 bytes) are found starting at location 9 and location 0. In aparticular embodiment, when matches of equal length are found, theclosest match may be used for the encoding, though other embodiments mayuse any suitable match for encoding. Accordingly, record 510B includes aliteral sequence having a length of one and a value of K and a matchhaving a length of four and a distance of six. Hashes continue on BCDG,CDGH, DGHI and because the end of the data set is reached, record 510Cis created with a literal sequence having a length of three and a valueof GHI and a match having a length of zero. In various embodiments, thecompressed data set 508 is then provided to a second compression engine102B. The further compression of the compressed data set 508 isexplained in connection with FIG. 6.

FIG. 6 illustrates a second example compressed data set 608 inaccordance with certain embodiments. The second example hash table 606and second compressed data set 608 represent a hash table and data setthat may be generated by, e.g., a second compression engine 102B that isconfigured to encode a minimum match length of three bytes and to storethe encoded data in records 610.

Short prefix table 604 depicts the values of the short prefix at variouspoints during the further compression of the first compressed data set.During this second pass, the second compression engine 102B looks fordata sequences in the literal sequences of the first compressed data setthat match previous data sequences in the data set. The bolded values inthe second compressed data set 608 represent a new match and two newliteral sequences generated by the second compression engine (while theother matches and literal sequences are the same as correspondingmatches and literal sequences of the first compressed data set).

First, the values of the literal sequences of record 510A are passedthrough the short prefix and matches are sought. In a similar manner tothe example of FIG. 5, the hash table 606 is accessed and updated as thevalues of the literal sequence are iterated through. When the shortprefix includes the second instance of ABC, the pointer associated withthe hash value ABC_(H) in the hash table 606 is accessed and the datasequence at that location is compared against the data sequence at thecurrent location (beginning with the short prefix) to determine whethera match exists and if a positive match is found, how long the match is(in various embodiments, the comparison may be bounded by the end of thecurrent literal sequence since subsequent data may already be encoded ina match found by the first compression engine). In this case, the matchis of length three. Accordingly, the portion of the literal sequence forwhich no matches were found (in this case ABCDE) is encoded as a literalsequence having length five in a record 610A along with a match having alength of three and a distance of five (which refers back to theprevious instance of ABC). After this match, F is the only valueremaining in the literal sequence, so it is encoded as a literalsequence of length one in record 610B along with the match previouslyfound by the first compression engine (corresponding to the repeatedinstance of ABCDE) which may be passed through to the output. Despite nocomparison or encoding operations being performed on the valuesrepresented by this match, the values may still pass through the shortprefix so that they can be hashed and the hash table 606 can be updatedaccordingly.

In the hash table 606, the values lined through represent pointers thatwere once valid and then overwritten (since in this embodiment,compression engine 102B may only compare the data starting with theshort prefix for a match against the data starting at a single locationindicated by a pointer in the hash table). Accordingly, when the secondinstance of ABC is found, the pointer associated with ABC_(H) isoverwritten to the value of five, when the third instance of ABC isfound, the pointer is overwritten to nine, and when the fourth instanceis found, the pointer is overwritten to a value of fifteen.

Since the length of the literal sequence in the next record 610C is lessthan the minimum match length (three bytes), the literal sequence andthe accompanying match are passed straight through to the output (butthe data slides through the short prefix so that the hash table can beupdated) and thus are shown in bold. When the last record is processed,the value GHI is placed in the short prefix and hashed, but since nomatches are found, GHI is encoded as a literal sequence and a matchlength of zero is added to record 610D, thus completing the generationof the second compressed data set 608. In various embodiments, thesecond compressed data set may be compressed further (e.g., by Huffmanor other encoding) before being output (e.g., in an output buffer 122).

FIG. 7 illustrates an example block diagram of a field programmable gatearray (FGPA) 700 in accordance with certain embodiments. In a particularembodiment, a compression engine 102 may be implemented by an FPGA 700.An FPGA may be a semiconductor device that includes configurable logic.An FPGA may be programmed via a data structure (e.g., a bitstream)having any suitable format that defines how the logic of the FPGA is tobe configured. An FPGA may be reprogrammed any number of times after theFPGA is manufactured.

In the depicted embodiment, FPGA 700 includes configurable logic 702,operational logic 704, communication controller 706, and memorycontroller 710. Configurable logic 702 may be programmed to implementone or more kernels. A kernel may comprise configured logic of the FPGAthat may receive a set of one or more inputs, process the set of inputsusing the configured logic, and provide a set of one or more outputs.The kernel may perform any suitable type of processing. In variousembodiments, a kernel may comprise a compression engine 102. Some FPGAs700 may be limited to executing a single kernel at a time while otherFPGAs may be capable of executing multiple kernels simultaneously. Theconfigurable logic 702 may include any suitable logic, such as anysuitable type of logic gates (e.g., AND gates, XOR gates) orcombinations of logic gates (e.g., flip flops, look up tables, adders,multipliers, multiplexers, demultiplexers). In some embodiments, thelogic is configured (at least in part) through programmableinterconnects between logic components of the FPGA.

Operational logic 704 may access a data structure defining a kernel andconfigure the configurable logic 702 based on the data structure andperform other operations of the FPGA. In some embodiments, operationallogic 704 may write control bits to memory (e.g., nonvolatile flashmemory or SRAM based memory) of the FPGA 700 based on the datastructure, wherein the control bits operate to configure the logic(e.g., by activating or deactivating particular interconnects betweenportions of the configurable logic). The operational logic 704 mayinclude any suitable logic (which may be implemented in configurablelogic or fixed logic), such as one or more memory devices including anysuitable type of memory (e.g., random access memory (RAM)), one or moretransceivers, clocking circuitry, one or more processors located on theFPGA, one or more controllers, or other suitable logic.

Communication controller 706 may enable FPGA 700 to communicate withother components (e.g., another compression engine 102) of a computersystem (e.g., to receive commands to compress data sets). Memorycontroller 710 may enable the FPGA to read data (e.g., operands orresults) from or write data to memory of the computer system 100. Invarious embodiments, memory controller 710 may comprise a direct memoryaccess (DMA) controller.

In various embodiments, a compression engine 102 may be implemented byone or more processor cores of a processor. Processor cores may beimplemented in different ways, for different purposes, and in differentprocessors. For instance, implementations of such cores may include: 1)a general purpose in-order core intended for general-purpose computing;2) a high performance general purpose out-of-order core intended forgeneral-purpose computing; 3) a special purpose core intended primarilyfor graphics and/or scientific (throughput) computing. Implementationsof different processors may include: 1) a CPU including one or moregeneral purpose in-order cores intended for general-purpose computingand/or one or more general purpose out-of-order cores intended forgeneral-purpose computing; and 2) a coprocessor including one or morespecial purpose cores intended primarily for graphics and/or scientific(throughput). Such different processors lead to different computersystem architectures, which may include: 1) the coprocessor on aseparate chip from the CPU; 2) the coprocessor on a separate die in thesame package as a CPU; 3) the coprocessor on the same die as a CPU (inwhich case, such a coprocessor is sometimes referred to as specialpurpose logic, such as integrated graphics and/or scientific(throughput) logic, or as special purpose cores); and 4) a system on achip that may include on the same die the described CPU (sometimesreferred to as the application core(s) or application processor(s)), theabove described coprocessor, and additional functionality. Exemplarycore architectures are described next, followed by descriptions ofexemplary processors and computer architectures.

FIG. 8A is a block diagram illustrating both an exemplary in-orderpipeline and an exemplary register renaming, out-of-orderissue/execution pipeline according to embodiments of the disclosure.FIG. 8B is a block diagram illustrating both an exemplary embodiment ofan in-order architecture core and an exemplary register renaming,out-of-order issue/execution architecture core to be included in aprocessor according to embodiments of the disclosure. The solid linedboxes in FIGS. 8A-B illustrate the in-order pipeline and in-order core,while the optional addition of the dashed lined boxes illustrates theregister renaming, out-of-order issue/execution pipeline and core. Giventhat the in-order aspect is a subset of the out-of-order aspect, theout-of-order aspect will be described.

In FIG. 8A, a processor pipeline 800 includes a fetch stage 802, alength decode stage 804, a decode stage 806, an allocation stage 808, arenaming stage 810, a scheduling (also known as a dispatch or issue)stage 812, a register read/memory read stage 814, an execute stage 816,a write back/memory write stage 818, an exception handling stage 822,and a commit stage 824.

FIG. 8B shows processor core 890 including a front end unit 830 coupledto an execution engine unit 850, and both are coupled to a memory unit870. The core 890 may be a reduced instruction set computing (RISC)core, a complex instruction set computing (CISC) core, a very longinstruction word (VLIW) core, or a hybrid or alternative core type. Asyet another option, the core 890 may be a special-purpose core, such as,for example, a network or communication core, compression engine,coprocessor core, general purpose computing graphics processing unit(GPGPU) core, graphics core, or the like.

The front end unit 830 includes a branch prediction unit 832 coupled toan instruction cache unit 834, which is coupled to an instructiontranslation lookaside buffer (TLB) 836, which is coupled to aninstruction fetch unit 838, which is coupled to a decode unit 840. Thedecode unit 840 (or decoder) may decode instructions, and generate as anoutput one or more micro-operations, micro-code entry points,microinstructions, other instructions, or other control signals, whichare decoded from, or which otherwise reflect, or are derived from, theoriginal instructions. The decode unit 840 may be implemented usingvarious different mechanisms. Examples of suitable mechanisms include,but are not limited to, look-up tables, hardware implementations,programmable logic arrays (PLAs), microcode read only memories (ROMs),etc. In one embodiment, the core 890 includes a microcode ROM or othermedium that stores microcode for certain macroinstructions (e.g., indecode unit 840 or otherwise within the front end unit 830). The decodeunit 840 is coupled to a rename/allocator unit 852 in the executionengine unit 850.

The execution engine unit 850 includes the rename/allocator unit 852coupled to a retirement unit 854 and a set of one or more schedulerunit(s) 856. The scheduler unit(s) 856 represents any number ofdifferent schedulers, including reservations stations, centralinstruction window, etc. The scheduler unit(s) 856 is coupled to thephysical register file(s) unit(s) 858. Each of the physical registerfile(s) units 858 represents one or more physical register files,different ones of which store one or more different data types, such asscalar integer, scalar floating point, packed integer, packed floatingpoint, vector integer, vector floating point, status (e.g., aninstruction pointer that is the address of the next instruction to beexecuted), etc. In one embodiment, the physical register file(s) unit858 comprises a vector registers unit, a write mask registers unit, anda scalar registers unit. These register units may provide architecturalvector registers, vector mask registers, and general purpose registers.The physical register file(s) unit(s) 858 is overlapped by theretirement unit 854 to illustrate various ways in which registerrenaming and out-of-order execution may be implemented (e.g., using areorder buffer(s) and a retirement register file(s); using a futurefile(s), a history buffer(s), and a retirement register file(s); using aregister maps and a pool of registers; etc.). The retirement unit 854and the physical register file(s) unit(s) 858 are coupled to theexecution cluster(s) 860. The execution cluster(s) 860 includes a set ofone or more execution units 862 and a set of one or more memory accessunits 864. The execution units 862 may perform various operations (e.g.,shifts, addition, subtraction, multiplication) and on various types ofdata (e.g., scalar floating point, packed integer, packed floatingpoint, vector integer, vector floating point). While some embodimentsmay include a number of execution units dedicated to specific functionsor sets of functions, other embodiments may include only one executionunit or multiple execution units that all perform all functions. Thescheduler unit(s) 856, physical register file(s) unit(s) 858, andexecution cluster(s) 860 are shown as being possibly plural becausecertain embodiments create separate pipelines for certain types ofdata/operations (e.g., a scalar integer pipeline, a scalar floatingpoint/packed integer/packed floating point/vector integer/vectorfloating point pipeline, and/or a memory access pipeline that each havetheir own scheduler unit, physical register file(s) unit, and/orexecution cluster—and in the case of a separate memory access pipeline,certain embodiments are implemented in which only the execution clusterof this pipeline has the memory access unit(s) 864). It should also beunderstood that where separate pipelines are used, one or more of thesepipelines may be out-of-order issue/execution and the rest in-order.

The set of memory access units 864 is coupled to the memory unit 870,which includes a data TLB unit 872 coupled to a data cache unit 874coupled to a level 2 (L2) cache unit 876. In one exemplary embodiment,the memory access units 864 may include a load unit, a store addressunit, and a store data unit, each of which is coupled to the data TLBunit 872 in the memory unit 870. The instruction cache unit 834 isfurther coupled to a level 2 (L2) cache unit 876 in the memory unit 870.The L2 cache unit 876 is coupled to one or more other levels of cacheand eventually to a main memory.

By way of example, the exemplary register renaming, out-of-orderissue/execution core architecture may implement the pipeline 800 asfollows: 1) the instruction fetch 838 performs the fetch and lengthdecoding stages 802 and 804; 2) the decode unit 840 performs the decodestage 806; 3) the rename/allocator unit 852 performs the allocationstage 808 and renaming stage 810; 4) the scheduler unit(s) 856 performsthe schedule stage 812; 5) the physical register file(s) unit(s) 858 andthe memory unit 870 perform the register read/memory read stage 814; theexecution cluster 860 perform the execute stage 816; 6) the memory unit870 and the physical register file(s) unit(s) 858 perform the writeback/memory write stage 818; 7) various units may be involved in theexception handling stage 822; and 8) the retirement unit 854 and thephysical register file(s) unit(s) 858 perform the commit stage 824.

The core 890 may support one or more instructions sets (e.g., the x86instruction set (with some extensions that have been added with newerversions); the MIPS instruction set of MIPS Technologies of Sunnyvale,Calif.; the ARM instruction set (with optional additional extensionssuch as NEON) of ARM Holdings of Sunnyvale, Calif.), including theinstruction(s) described herein. In one embodiment, the core 890includes logic to support a packed data instruction set extension (e.g.,AVX1, AVX2), thereby allowing the operations used by many multimediaapplications to be performed using packed data.

It should be understood that the core may support multithreading(executing two or more parallel sets of operations or threads), and maydo so in a variety of ways including time sliced multithreading,simultaneous multithreading (where a single physical core provides alogical core for each of the threads that physical core issimultaneously multithreading), or a combination thereof (e.g., timesliced fetching and decoding and simultaneous multithreading thereaftersuch as in the Intel® Hyperthreading technology).

While register renaming is described in the context of out-of-orderexecution, it should be understood that register renaming may be used inan in-order architecture. While the illustrated embodiment of theprocessor also includes separate instruction and data cache units834/874 and a shared L2 cache unit 876, alternative embodiments may havea single internal cache for both instructions and data, such as, forexample, a Level 1 (L1) internal cache, or multiple levels of internalcache. In some embodiments, the system may include a combination of aninternal cache and an external cache that is external to the core and/orthe processor. Alternatively, all of the cache may be external to thecore and/or the processor.

FIGS. 9A-B illustrate a block diagram of a more specific exemplaryin-order core architecture, which core would be one of several logicblocks (potentially including other cores of the same type and/ordifferent types) in a chip. The logic blocks communicate through ahigh-bandwidth interconnect network (e.g., a ring network) with somefixed function logic, memory I/O interfaces, and other necessary I/Ologic, depending on the application.

FIG. 9A is a block diagram of a single processor core, along with itsconnection to the on-die interconnect network 902 and with its localsubset of the Level 2 (L2) cache 904, according to various embodiments.In one embodiment, an instruction decoder 900 supports the x86instruction set with a packed data instruction set extension. An L1cache 906 allows low-latency accesses to cache memory into the scalarand vector units. While in one embodiment (to simplify the design), ascalar unit 908 and a vector unit 910 use separate register sets(respectively, scalar registers 912 and vector registers 914) and datatransferred between them is written to memory and then read back in froma level 1 (L1) cache 906, alternative embodiments may use a differentapproach (e.g., use a single register set or include a communicationpath that allow data to be transferred between the two register fileswithout being written and read back).

The local subset of the L2 cache 904 is part of a global L2 cache thatis divided into separate local subsets (in some embodiments one perprocessor core). Each processor core has a direct access path to its ownlocal subset of the L2 cache 904. Data read by a processor core isstored in its L2 cache subset 904 and can be accessed quickly, inparallel with other processor cores accessing their own local L2 cachesubsets. Data written by a processor core is stored in its own L2 cachesubset 904 and is flushed from other subsets, if necessary. The ringnetwork ensures coherency for shared data. The ring network isbi-directional to allow agents such as processor cores, L2 caches andother logic blocks to communicate with each other within the chip. In aparticular embodiment, each ring data-path is 1012-bits wide perdirection.

FIG. 9B is an expanded view of part of the processor core in FIG. 9Aaccording to embodiments. FIG. 9B includes an L1 data cache 906A (partof the L1 cache 906), as well as more detail regarding the vector unit910 and the vector registers 914. Specifically, the vector unit 910 is a16-wide vector processing unit (VPU) (see the 16-wide ALU 928), whichexecutes one or more of integer, single-precision float, anddouble-precision float instructions. The VPU supports swizzling theregister inputs with swizzle unit 920, numeric conversion with numericconvert units 922A-B, and replication with replication unit 924 on thememory input. Write mask registers 926 allow predicating resultingvector writes.

FIG. 10 is a block diagram of a processor 1000 that may have more thanone core, may have an integrated memory controller, and may haveintegrated graphics according to various embodiments. The solid linedboxes in FIG. 10 illustrate a processor 1000 with a single core 1002A, asystem agent 1010, and a set of one or more bus controller units 1016;while the optional addition of the dashed lined boxes illustrates analternative processor 1000 with multiple cores 1002A-N, a set of one ormore integrated memory controller unit(s) 1014 in the system agent unit1010, and special purpose logic 1008.

Thus, different implementations of the processor 1000 may include: 1) aCPU with the special purpose logic 1008 being integrated graphics and/orscientific (throughput) logic (which may include one or more cores), andthe cores 1002A-N being one or more general purpose cores (e.g., generalpurpose in-order cores, general purpose out-of-order cores, or acombination of the two); 2) a coprocessor with the cores 1002A-N being alarge number of special purpose cores intended primarily for graphicsand/or scientific (throughput); and 3) a coprocessor with the cores1002A-N being a large number of general purpose in-order cores. Thus,the processor 1000 may be a general-purpose processor, coprocessor orspecial-purpose processor, such as, for example, a network orcommunication processor, compression engine, graphics processor, GPGPU(general purpose graphics processing unit), a high-throughput manyintegrated core (MIC) coprocessor (e.g., including 30 or more cores),embedded processor, or other fixed or configurable logic that performslogical operations. The processor may be implemented on one or morechips. The processor 1000 may be a part of and/or may be implemented onone or more substrates using any of a number of process technologies,such as, for example, BiCMOS, CMOS, or NMOS.

In various embodiments, a processor may include any number of processingelements that may be symmetric or asymmetric. In one embodiment, aprocessing element refers to hardware or logic to support a softwarethread. Examples of hardware processing elements include: a thread unit,a thread slot, a thread, a process unit, a context, a context unit, alogical processor, a hardware thread, a core, and/or any other element,which is capable of holding a state for a processor, such as anexecution state or architectural state. In other words, a processingelement, in one embodiment, refers to any hardware capable of beingindependently associated with code, such as a software thread, operatingsystem, application, or other code. A physical processor (or processorsocket) typically refers to an integrated circuit, which potentiallyincludes any number of other processing elements, such as cores orhardware threads.

A core may refer to logic located on an integrated circuit capable ofmaintaining an independent architectural state, wherein eachindependently maintained architectural state is associated with at leastsome dedicated execution resources. A hardware thread may refer to anylogic located on an integrated circuit capable of maintaining anindependent architectural state, wherein the independently maintainedarchitectural states share access to execution resources. As can beseen, when certain resources are shared and others are dedicated to anarchitectural state, the line between the nomenclature of a hardwarethread and core overlaps. Yet often, a core and a hardware thread areviewed by an operating system as individual logical processors, wherethe operating system is able to individually schedule operations on eachlogical processor.

The memory hierarchy includes one or more levels of cache within thecores, a set or one or more shared cache units 1006, and external memory(not shown) coupled to the set of integrated memory controller units1014. The set of shared cache units 1006 may include one or moremid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), orother levels of cache, a last level cache (LLC), and/or combinationsthereof. While in one embodiment a ring based interconnect unit 1012interconnects the special purpose logic (e.g., integrated graphicslogic) 1008, the set of shared cache units 1006, and the system agentunit 1010/integrated memory controller unit(s) 1014, alternativeembodiments may use any number of well-known techniques forinterconnecting such units. In one embodiment, coherency is maintainedbetween one or more cache units 1006 and cores 1002A-N.

In some embodiments, one or more of the cores 1002A-N are capable ofmultithreading. The system agent 1010 includes those componentscoordinating and operating cores 1002A-N. The system agent unit 1010 mayinclude for example a power control unit (PCU) and a display unit. ThePCU may be or include logic and components needed for regulating thepower state of the cores 1002A-N and the special purpose logic 1008. Thedisplay unit is for driving one or more externally connected displays.

The cores 1002A-N may be homogenous or heterogeneous in terms ofarchitecture instruction set; that is, two or more of the cores 1002A-Nmay be capable of executing the same instruction set, while others maybe capable of executing only a subset of that instruction set or adifferent instruction set.

FIGS. 11-14 are block diagrams of exemplary computer architectures.Other system designs and configurations known in the arts for laptops,desktops, handheld PCs, personal digital assistants, engineeringworkstations, servers, network devices, network hubs, switches, embeddedprocessors, digital signal processors (DSPs), graphics devices, videogame devices, set-top boxes, micro controllers, cell phones, portablemedia players, hand held devices, and various other electronic devices,are also suitable for performing the methods described in thisdisclosure. In general, a huge variety of systems or electronic devicescapable of incorporating a processor and/or other execution logic asdisclosed herein are generally suitable.

FIG. 11 depicts a block diagram of a system 1100 in accordance with oneembodiment of the present disclosure. The system 1100 may include one ormore processors 1110, 1115, which are coupled to a controller hub 1120.In one embodiment the controller hub 1120 includes a graphics memorycontroller hub (GMCH) 1190 and an Input/Output Hub (IOH) 1150 (which maybe on separate chips or the same chip); the GMCH 1190 includes memoryand graphics controllers coupled to memory 1140 and a coprocessor 1145;the IOH 1150 couples input/output (I/O) devices 1160 to the GMCH 1190.Alternatively, one or both of the memory and graphics controllers areintegrated within the processor (as described herein), the memory 1140and the coprocessor 1145 are coupled directly to the processor 1110, andthe controller hub 1120 is a single chip comprising the IOH 1150.

The optional nature of additional processors 1115 is denoted in FIG. 11with broken lines. Each processor 1110, 1115 may include one or more ofthe processing cores described herein and may be some version of theprocessor 1000.

The memory 1140 may be, for example, dynamic random access memory(DRAM), phase change memory (PCM), other suitable memory, or anycombination thereof. The memory 1140 may store any suitable data, suchas data used by processors 1110, 1115 to provide the functionality ofcomputer system 1100. For example, data associated with programs thatare executed or files accessed by processors 1110, 1115 may be stored inmemory 1140. In various embodiments, memory 1140 may store data and/orsequences of instructions that are used or executed by processors 1110,1115.

In at least one embodiment, the controller hub 1120 communicates withthe processor(s) 1110, 1115 via a multi-drop bus, such as a frontsidebus (FSB), point-to-point interface such as QuickPath Interconnect(QPI), or similar connection 1195.

In one embodiment, the coprocessor 1145 is a special-purpose processor,such as, for example, a high-throughput MIC processor, a network orcommunication processor, compression engine, graphics processor, GPGPU,embedded processor, or the like. In one embodiment, controller hub 1120may include an integrated graphics accelerator.

There can be a variety of differences between the physical resources1110, 1115 in terms of a spectrum of metrics of merit includingarchitectural, microarchitectural, thermal, power consumptioncharacteristics, and the like.

In one embodiment, the processor 1110 executes instructions that controldata processing operations of a general type. Embedded within theinstructions may be coprocessor instructions. The processor 1110recognizes these coprocessor instructions as being of a type that shouldbe executed by the attached coprocessor 1145. Accordingly, the processor1110 issues these coprocessor instructions (or control signalsrepresenting coprocessor instructions) on a coprocessor bus or otherinterconnect, to coprocessor 1145. Coprocessor(s) 1145 accept andexecute the received coprocessor instructions.

FIG. 12 depicts a block diagram of a first more specific exemplarysystem 1200 in accordance with an embodiment of the present disclosure.As shown in FIG. 12, multiprocessor system 1200 is a point-to-pointinterconnect system, and includes a first processor 1270 and a secondprocessor 1280 coupled via a point-to-point interconnect 1250. Each ofprocessors 1270 and 1280 may be some version of the processor 1000. Inone embodiment of the disclosure, processors 1270 and 1280 arerespectively processors 1110 and 1115, while coprocessor 1238 iscoprocessor 1145. In another embodiment, processors 1270 and 1280 arerespectively processor 1110 and coprocessor 1145.

Processors 1270 and 1280 are shown including integrated memorycontroller (IMC) units 1272 and 1282, respectively. Processor 1270 alsoincludes as part of its bus controller units point-to-point (P-P)interfaces 1276 and 1278; similarly, second processor 1280 includes P-Pinterfaces 1286 and 1288. Processors 1270, 1280 may exchange informationvia a point-to-point (P-P) interface 1250 using P-P interface circuits1278, 1288. As shown in FIG. 12, IMCs 1272 and 1282 couple theprocessors to respective memories, namely a memory 1232 and a memory1234, which may be portions of main memory locally attached to therespective processors.

Processors 1270, 1280 may each exchange information with a chipset 1290via individual P-P interfaces 1252, 1254 using point to point interfacecircuits 1276, 1294, 1286, 1298. Chipset 1290 may optionally exchangeinformation with the coprocessor 1238 via a high-performance interface1239. In one embodiment, the coprocessor 1238 is a special-purposeprocessor, such as, for example, a high-throughput MIC processor, anetwork or communication processor, compression engine, graphicsprocessor, GPGPU, embedded processor, or the like.

A shared cache (not shown) may be included in either processor oroutside of both processors, yet connected with the processors via a P-Pinterconnect, such that either or both processors' local cacheinformation may be stored in the shared cache if a processor is placedinto a low power mode.

Chipset 1290 may be coupled to a first bus 1216 via an interface 1296.In one embodiment, first bus 1216 may be a Peripheral ComponentInterconnect (PCI) bus, or a bus such as a PCI Express bus or anotherthird generation I/O interconnect bus, although the scope of the presentdisclosure is not so limited.

As shown in FIG. 12, various I/O devices 1214 may be coupled to firstbus 1216, along with a bus bridge 1218 which couples first bus 1216 to asecond bus 1220. In one embodiment, one or more additional processor(s)1215, such as coprocessors, high-throughput MIC processors, GPGPU's,accelerators (such as, e.g., graphics accelerators or digital signalprocessing (DSP) units), field programmable gate arrays, or any otherprocessor, are coupled to first bus 1216. In one embodiment, second bus1220 may be a low pin count (LPC) bus. Various devices may be coupled toa second bus 1220 including, for example, a keyboard and/or mouse 1222,communication devices 1227 and a storage unit 1228 such as a disk driveor other mass storage device which may include instructions/code anddata 1230, in one embodiment. Further, an audio I/O 1224 may be coupledto the second bus 1220. Note that other architectures are contemplatedby this disclosure. For example, instead of the point-to-pointarchitecture of FIG. 12, a system may implement a multi-drop bus orother such architecture.

FIG. 13 depicts a block diagram of a second more specific exemplarysystem 1300 in accordance with an embodiment of the present disclosure.Similar elements in FIGS. 12 and 13 bear similar reference numerals, andcertain aspects of FIG. 12 have been omitted from FIG. 13 in order toavoid obscuring other aspects of FIG. 13.

FIG. 13 illustrates that the processors 1270, 1280 may includeintegrated memory and I/O control logic (“CL”) 1272 and 1282,respectively. Thus, the CL 1272, 1282 include integrated memorycontroller units and include I/O control logic. FIG. 13 illustrates thatnot only are the memories 1232, 1234 coupled to the CL 1272, 1282, butalso that I/O devices 1314 are also coupled to the control logic 1272,1282. Legacy I/O devices 1315 are coupled to the chipset 1290.

FIG. 14 depicts a block diagram of a SoC 1400 in accordance with anembodiment of the present disclosure. Similar elements in FIG. 10 bearsimilar reference numerals. Also, dashed lined boxes are optionalfeatures on more advanced SoCs. In FIG. 14, an interconnect unit(s) 1402is coupled to: an application processor 1410 which includes a set of oneor more cores 202A-N and shared cache unit(s) 1006; a system agent unit1010; a bus controller unit(s) 1016; an integrated memory controllerunit(s) 1014; a set or one or more coprocessors 1420 which may includeintegrated graphics logic, an image processor, an audio processor, and avideo processor; an static random access memory (SRAM) unit 1430; adirect memory access (DMA) unit 1432; and a display unit 1440 forcoupling to one or more external displays. In one embodiment, thecoprocessor(s) 1420 include a special-purpose processor, such as, forexample, a network or communication processor, compression engine,GPGPU, a high-throughput MIC processor, embedded processor, or the like.

A design may go through various stages, from creation to simulation tofabrication. Data representing a design may represent the design in anumber of manners. First, as is useful in simulations, the hardware maybe represented using a hardware description language (HDL) or anotherfunctional description language. Additionally, a circuit level modelwith logic and/or transistor gates may be produced at some stages of thedesign process. Furthermore, most designs, at some stage, reach a levelof data representing the physical placement of various devices in thehardware model. In the case where conventional semiconductor fabricationtechniques are used, the data representing the hardware model may be thedata specifying the presence or absence of various features on differentmask layers for masks used to produce the integrated circuit. In someimplementations, such data may be stored in a database file format suchas Graphic Data System II (GDS II), Open Artwork System InterchangeStandard (OASIS), or similar format.

In some implementations, software based hardware models, and HDL andother functional description language objects can include registertransfer language (RTL) files, among other examples. Such objects can bemachine-parsable such that a design tool can accept the HDL object (ormodel), parse the HDL object for attributes of the described hardware,and determine a physical circuit and/or on-chip layout from the object.The output of the design tool can be used to manufacture the physicaldevice. For instance, a design tool can determine configurations ofvarious hardware and/or firmware elements from the HDL object, such asbus widths, registers (including sizes and types), memory blocks,physical link paths, fabric topologies, among other attributes thatwould be implemented in order to realize the system modeled in the HDLobject. Design tools can include tools for determining the topology andfabric configurations of system on chip (SoC) and other hardware device.In some instances, the HDL object can be used as the basis fordeveloping models and design files that can be used by manufacturingequipment to manufacture the described hardware. Indeed, an HDL objectitself can be provided as an input to manufacturing system software tocause the manufacture of the described hardware.

In any representation of the design, the data representing the designmay be stored in any form of a machine readable medium. A memory or amagnetic or optical storage such as a disc may be the machine readablemedium to store information transmitted via optical or electrical wavemodulated or otherwise generated to transmit such information. When anelectrical carrier wave indicating or carrying the code or design istransmitted, to the extent that copying, buffering, or re-transmissionof the electrical signal is performed, a new copy is made. Thus, acommunication provider or a network provider may store on a tangible,machine-readable medium, at least temporarily, an article, such asinformation encoded into a carrier wave, embodying techniques ofembodiments of the present disclosure.

Thus, one or more aspects of at least one embodiment may be implementedby representative instructions stored on a machine-readable medium whichrepresents various logic within the processor, which when read by amachine causes the machine to fabricate logic to perform the techniquesdescribed herein. Such representations, often referred to as “IP cores”may be stored on a non-transitory tangible machine readable medium andsupplied to various customers or manufacturing facilities to load intothe fabrication machines that manufacture the logic or processor.

Embodiments of the mechanisms disclosed herein may be implemented inhardware, software, firmware, or a combination of such implementationapproaches. Embodiments of the disclosure may be implemented as computerprograms or program code executing on programmable systems comprising atleast one processor, a storage system (including volatile andnon-volatile memory and/or storage elements), at least one input device,and at least one output device.

Program code, such as code 1230 illustrated in FIG. 12, may be appliedto input instructions to perform the functions described herein andgenerate output information. The output information may be applied toone or more output devices, in known fashion. For purposes of thisapplication, a processing system includes any system that has aprocessor, such as, for example; a digital signal processor (DSP), amicrocontroller, an application specific integrated circuit (ASIC), or amicroprocessor.

The program code may be implemented in a high level procedural or objectoriented programming language to communicate with a processing system.The program code may also be implemented in assembly or machinelanguage, if desired. In fact, the mechanisms described herein are notlimited in scope to any particular programming language. In variousembodiments, the language may be a compiled or interpreted language.

The embodiments of methods, hardware, software, firmware or code setforth above may be implemented via instructions or code stored on amachine-accessible, machine readable, computer accessible, or computerreadable medium which are executable (or otherwise accessible) by aprocessing element. A non-transitory machine-accessible/readable mediumincludes any mechanism that provides (i.e., stores and/or transmits)information in a form readable by a machine, such as a computer orelectronic system. For example, a non-transitory machine-accessiblemedium includes random-access memory (RAM), such as static RAM (SRAM) ordynamic RAM (DRAM); ROM; magnetic or optical storage medium; flashmemory devices; electrical storage devices; optical storage devices;acoustical storage devices; other form of storage devices for holdinginformation received from transitory (propagated) signals (e.g., carrierwaves, infrared signals, digital signals); etc., which are to bedistinguished from the non-transitory mediums that may receiveinformation therefrom.

Instructions used to program logic to perform embodiments of thedisclosure may be stored within a memory in the system, such as DRAM,cache, flash memory, or other storage. Furthermore, the instructions canbe distributed via a network or by way of other computer readable media.Thus a machine-readable medium may include any mechanism for storing ortransmitting information in a form readable by a machine (e.g., acomputer), but is not limited to, floppy diskettes, optical disks,Compact Disc, Read-Only Memory (CD-ROMs), and magneto-optical disks,Read-Only Memory (ROMs), Random Access Memory (RAM), ErasableProgrammable Read-Only Memory (EPROM), Electrically ErasableProgrammable Read-Only Memory (EEPROM), magnetic or optical cards, flashmemory, or a tangible, machine-readable storage used in the transmissionof information over the Internet via electrical, optical, acoustical orother forms of propagated signals (e.g., carrier waves, infraredsignals, digital signals, etc.). Accordingly, the computer-readablemedium includes any type of tangible machine-readable medium suitablefor storing or transmitting electronic instructions or information in aform readable by a machine (e.g., a computer).

Logic may be used to implement any of the functionality of the variouscomponents such as compression engines 102; FPGA 700; core 890;processor 1000; systems 1100, 1200, and 1300; and SoC 1400 (or any ofthe components of any of these), or other component described herein.“Logic” may refer to hardware, firmware, software and/or combinations ofeach to perform one or more functions. As an example, logic may includehardware, such as a micro-controller or processor, associated with anon-transitory medium to store code adapted to be executed by themicro-controller or processor. Therefore, reference to logic, in oneembodiment, refers to the hardware, which is specifically configured torecognize and/or execute the code to be held on a non-transitory medium.Furthermore, in another embodiment, use of logic refers to thenon-transitory medium including the code, which is specifically adaptedto be executed by the microcontroller to perform predeterminedoperations. And as can be inferred, in yet another embodiment, the termlogic (in this example) may refer to the combination of the hardware andthe non-transitory medium. In various embodiments, logic may include amicroprocessor or other processing element operable to execute softwareinstructions, discrete logic such as an application specific integratedcircuit (ASIC), a programmed logic device such as a field programmablegate array (FPGA), a memory device containing instructions, combinationsof logic devices (e.g., as would be found on a printed circuit board),or other suitable hardware and/or software. Logic may include one ormore gates or other circuit components, which may be implemented by,e.g., transistors. In some embodiments, logic may also be fully embodiedas software. Software may be embodied as a software package, code,instructions, instruction sets and/or data recorded on non-transitorycomputer readable storage medium. Firmware may be embodied as code,instructions or instruction sets and/or data that are hard-coded (e.g.,nonvolatile) in memory devices. Often, logic boundaries that areillustrated as separate commonly vary and potentially overlap. Forexample, first and second logic may share hardware, software, firmware,or a combination thereof, while potentially retaining some independenthardware, software, or firmware.

A memory element may include non-volatile memory and/or volatile memory.Non-volatile memory is a storage medium that does not require power tomaintain the state of data stored by the medium. Nonlimiting examples ofnonvolatile memory may include any or a combination of: solid statememory (such as planar or 3D NAND flash memory or NOR flash memory), 3Dcrosspoint memory, memory devices that use chalcogenide phase changematerial (e.g., chalcogenide glass), byte addressable nonvolatile memorydevices, ferroelectric memory, silicon-oxide-nitride-oxide-silicon(SONOS) memory, polymer memory (e.g., ferroelectric polymer memory),ferroelectric transistor random access memory (Fe-TRAM) ovonic memory,nanowire memory, electrically erasable programmable read-only memory(EEPROM), other various types of non-volatile random access memories(RAMs), and magnetic storage memory. In some embodiments, 3D crosspointmemory may comprise a transistor-less stackable cross point architecturein which memory cells sit at the intersection of words lines and bitlines and are individually addressable and in which bit storage is basedon a change in bulk resistance. Volatile memory is a storage medium thatrequires power to maintain the state of data stored by the medium.Examples of volatile memory may include various types of random accessmemory (RAM), such as dynamic random access memory (DRAM) or staticrandom access memory (SRAM). One particular type of DRAM that may beused in a memory module is synchronous dynamic random access memory(SDRAM).

Use of the phrase ‘to’ or ‘configured to,’ in one embodiment, refers toarranging, putting together, manufacturing, offering to sell, importingand/or designing an apparatus, hardware, logic, or element to perform adesignated or determined task. In this example, an apparatus or elementthereof that is not operating is still ‘configured to’ perform adesignated task if it is designed, coupled, and/or interconnected toperform said designated task. As a purely illustrative example, a logicgate may provide a 0 or a 1 during operation. But a logic gate‘configured to’ provide an enable signal to a clock does not includeevery potential logic gate that may provide a 1 or 0. Instead, the logicgate is one coupled in some manner that during operation the 1 or 0output is to enable the clock. Note once again that use of the term‘configured to’ does not require operation, but instead focus on thelatent state of an apparatus, hardware, and/or element, where in thelatent state the apparatus, hardware, and/or element is designed toperform a particular task when the apparatus, hardware, and/or elementis operating.

Furthermore, use of the phrases ‘capable of/to,’ and or ‘operable to,’in one embodiment, refers to some apparatus, logic, hardware, and/orelement designed in such a way to enable use of the apparatus, logic,hardware, and/or element in a specified manner. Note as above that useof to, capable to, or operable to, in one embodiment, refers to thelatent state of an apparatus, logic, hardware, and/or element, where theapparatus, logic, hardware, and/or element is not operating but isdesigned in such a manner to enable use of an apparatus in a specifiedmanner.

A value, as used herein, includes any known representation of a number,a state, a logical state, or a binary logical state. Often, the use oflogic levels, logic values, or logical values is also referred to as 1'sand 0's, which simply represents binary logic states. For example, a 1refers to a high logic level and 0 refers to a low logic level. In oneembodiment, a storage cell, such as a transistor or flash cell, may becapable of holding a single logical value or multiple logical values.However, other representations of values in computer systems have beenused. For example, the decimal number ten may also be represented as abinary value of 1010 and a hexadecimal letter A. Therefore, a valueincludes any representation of information capable of being held in acomputer system.

Moreover, states may be represented by values or portions of values. Asan example, a first value, such as a logical one, may represent adefault or initial state, while a second value, such as a logical zero,may represent a non-default state. In addition, the terms reset and set,in one embodiment, refer to a default and an updated value or state,respectively. For example, a default value potentially includes a highlogical value, i.e. reset, while an updated value potentially includes alow logical value, i.e. set. Note that any combination of values may beutilized to represent any number of states.

In at least one embodiment, an apparatus comprises a first compressionengine to receive a first compressed data block from a secondcompression engine that is to generate the first compressed data blockby compressing a first plurality of repeated instances of data that eachhave a length greater than or equal to a first length; and compress asecond plurality of repeated instances of data of the first compresseddata block that each have a length greater than or equal to a secondlength, the second length being shorter than the first length, whereineach compressed repeated instance of the first and second pluralities ofrepeated instances comprises a location and length of a data instancethat is repeated; and a memory buffer to store the compressed first andsecond plurality of repeated instances of data.

In an embodiment, the first compressed data block comprises a pluralityof records, a record comprising a compressed repeated instance of dataand an uncompressed portion of data of an input data set. In anembodiment, the first compression engine is further to identify arepeated instance of data by performing a hash function on a shortprefix of data of the first compressed data block to generate a hashvalue, the short prefix of data having a size equal to the secondlength; identifying a location associated with the hash value; anddetermining that the short prefix of data matches data at the locationassociated with the hash value. In an embodiment, the first compressionengine is further to determine a data sequence corresponding to acompressed repeated instance of data of the first compressed data block;apply a hash function to a short prefix of the data sequence to generatea hash value; and store a pointer associated with a location of the datasequence in a memory element to be used to access the data sequence forcomparison against a short prefix of data of an uncompressed portion ofthe first compressed data block. In an embodiment, the data sequence isdetermined by decoding the compressed repeated instance of data. In anembodiment, the data sequence is obtained from an input data set withoutdecoding the compressed repeated instance of data by accessing the inputdata set at a location corresponding to the location of the compressedrepeated instance of data. In an embodiment, the first length is onebyte greater than the second length. In an embodiment, the first lengthis four bytes and the second length is three bytes. In an embodiment,the first compression engine is a processor comprising at least oneprocessor core. In an embodiment, the second compression engine is ahardware accelerator. In an embodiment, a first hash table generated bythe first compression engine is to include a maximum of a singlelocation associated with each hash value of the first hash table; and asecond hash table generated by the second compression engine is toinclude a plurality of locations associated with each of at least someof the hash values of the second hash table. In an embodiment, the firstcompression engine is further to instruct the second compression togenerate the first compressed data block. In an embodiment, theapparatus further comprises the second compression engine. In anembodiment, the apparatus further comprises a battery communicativelycoupled to the first compression engine, a display communicativelycoupled to the first compression engine, or a network interfacecommunicatively coupled to the first compression engine.

In at least one embodiment, a method comprises receiving, by a firstcompression engine, a first compressed data block from a secondcompression engine that is to generate the first compressed data blockby compressing a first plurality of repeated instances of data that eachhave a length greater than or equal to a first length; and compressing,by the first compression engine, a second plurality of repeatedinstances of data of the first compressed data block that each have alength greater than or equal to a second length, the second length beingshorter than the first length, wherein each compressed repeated instanceof the first and second pluralities of repeated instances comprises alocation and length of a data instance that is repeated.

In an embodiment, the first compressed data block comprises a pluralityof records, a record comprising a compressed repeated instance of dataand an uncompressed portion of data of an input data set. In anembodiment, the method further comprises identifying a repeated instanceof data by performing a hash function on a short prefix of data of thefirst compressed data block to generate a hash value, the short prefixof data having a size equal to the second length; identifying a locationassociated with the hash value; and determining that the short prefix ofdata matches data at the location associated with the hash value. In anembodiment, the method further comprises determining a data sequencecorresponding to a compressed repeated instance of data of the firstcompressed data block; applying a hash function to a short prefix of thedata sequence to generate a hash value; and storing a pointer associatedwith a location of the data sequence in a hash table to be used toaccess the data sequence for comparison against a short prefix of dataof an uncompressed portion of the first compressed data block. In anembodiment, the data sequence is determined by decoding the compressedrepeated instance of data. In an embodiment, the data sequence isobtained from an input data set without decoding the compressed repeatedinstance of data by accessing the input data set at a locationcorresponding to the location of the compressed repeated instance ofdata. In an embodiment, the first length is one byte greater than thesecond length. In an embodiment, the first length is four bytes and thesecond length is three bytes. In an embodiment, a first hash tablegenerated by the first compression engine is to include a maximum of asingle location associated with each hash value of the first hash table;and a second hash table generated by the second compression engine is toinclude a plurality of locations associated with each of at least someof the hash values of the second hash table. In an embodiment, themethod further comprises instructing, by the first compression engine,the second compression to generate the first compressed data block. Inan embodiment, a system comprises means to perform any of the methods.In an embodiment, the means comprise machine-readable code that whenexecuted, cause a machine to perform one or more steps of the method ofany of the methods.

In at least one embodiment, a system comprises a first compressionengine to generate a first compressed data block by compressing a firstplurality of repeated instances of data that each have a length greaterthan or equal to a first length; and a second compression engine tocompress a second plurality of repeated instances of data of the firstcompressed data block that each have a length greater than or equal to asecond length, the second length being shorter than the first length,wherein each compressed repeated instance of the first and secondpluralities of repeated instances comprises a location and length of adata instance that is repeated.

In an embodiment, the first compression engine is a hardware acceleratorand the second compression engine is a processor comprising at least oneprocessor core. In an embodiment, the system further comprises a batterycommunicatively coupled to the second compression engine, a displaycommunicatively coupled to the second compression engine, or a networkinterface communicatively coupled to the second compression engine.

In at least one embodiment, a system comprises means for receiving afirst compressed data block from a compression engine that is togenerate the first compressed data block by compressing a firstplurality of repeated instances of data that each have a length greaterthan or equal to a first length; and means for compressing a secondplurality of repeated instances of data of the first compressed datablock that each have a length greater than or equal to a second length,the second length being shorter than the first length, wherein eachcompressed repeated instance of the first and second pluralities ofrepeated instances comprises a location and length of a data instancethat is repeated.

In an embodiment, the system further comprises means for identifying arepeated instance of data by performing a hash function on a shortprefix of data of the first compressed data block to generate a hashvalue, the short prefix of data having a size equal to the secondlength; identifying a location associated with the hash value; anddetermining that the short prefix of data matches data at the locationassociated with the hash value. In an embodiment, the system furthercomprises means for determining a data sequence corresponding to acompressed repeated instance of data of the first compressed data block;means for applying a hash function to a short prefix of the datasequence to generate a hash value; and means for storing a pointerassociated with a location of the data sequence in a hash table to beused to access the data sequence for comparison against a short prefixof data of an uncompressed portion of the first compressed data block.In an embodiment, the first length is one byte greater than the secondlength. In an embodiment, the first length is four bytes and the secondlength is three bytes.

In at least one embodiment, at least one non-transitory machineaccessible storage medium has instructions stored thereon, theinstructions when executed on a machine, to cause the machine toreceive, by a first compression engine, a first compressed data blockfrom a second compression engine that is to generate the firstcompressed data block by compressing a first plurality of repeatedinstances of data that each have a length greater than or equal to afirst length; and compress, by the first compression engine, a secondplurality of repeated instances of data of the first compressed datablock that each have a length greater than or equal to a second length,the second length being shorter than the first length, wherein eachcompressed repeated instance of the first and second pluralities ofrepeated instances comprises a location and length of a data instancethat is repeated. In an embodiment, the first compression engine is aprocessor comprising at least one processor core and the secondcompression engine is a hardware accelerator. In an embodiment, theinstructions when executed are further to cause the machine to identifya repeated instance of data by performing a hash function on a shortprefix of data of the first compressed data block to generate a hashvalue, the short prefix of data having a size equal to the secondlength; identifying a location associated with the hash value; anddetermining that the short prefix of data matches data at the locationassociated with the hash value. In an embodiment, the first length isone byte greater than the second length.

Reference throughout this specification to “one embodiment” or “anembodiment” means that a particular feature, structure, orcharacteristic described in connection with the embodiment is includedin at least one embodiment of the present disclosure. Thus, theappearances of the phrases “in one embodiment” or “in an embodiment” invarious places throughout this specification are not necessarily allreferring to the same embodiment. Furthermore, the particular features,structures, or characteristics may be combined in any suitable manner inone or more embodiments.

In the foregoing specification, a detailed description has been givenwith reference to specific exemplary embodiments. It will, however, beevident that various modifications and changes may be made theretowithout departing from the broader spirit and scope of the disclosure asset forth in the appended claims. The specification and drawings are,accordingly, to be regarded in an illustrative sense rather than arestrictive sense. Furthermore, the foregoing use of embodiment andother exemplarily language does not necessarily refer to the sameembodiment or the same example, but may refer to different and distinctembodiments, as well as potentially the same embodiment.

What is claimed is:
 1. An apparatus comprising: a memory to store atleast a portion of a data block; and a first compression enginecomprising circuitry, the first compression engine to: hash a pluralityof first portions of the data block to generate a plurality of firsthash values, each first portion having a first length; generate a firstcompressed data block by compressing a first plurality of repeatedinstances of data of the data block, wherein the repeated instances ofdata each have a length greater than or equal to the first length,wherein the first plurality of repeated instances are identified basedat least in part on the first hash values; and provide the firstcompressed data block for further compression by a second compressionengine, the second compression engine to hash a plurality of secondportions of the first compressed data block to generate a plurality ofsecond hash values, each second portion having a second length shorterthan the first length, the second compression engine to compress asecond plurality of repeated instances of data of the first compresseddata block that each have a length greater than or equal to the secondlength, wherein the second plurality of repeated instances areidentified based at least in part on the second hash values.
 2. Theapparatus of claim 1, wherein the first compression engine comprises afield programmable gate array comprising the circuitry to generate thefirst compressed data block.
 3. The apparatus of claim 1, wherein thefirst compression engine comprises a microprocessor comprising thecircuitry to generate the first compressed data block.
 4. The apparatusof claim 1, wherein each compressed repeated instance of the firstplurality of repeated instances comprises a location and length of adata instance that is repeated.
 5. The apparatus of claim 4, whereineach compressed repeated instance of the second plurality of repeatedinstances comprises a location and length of a data instance that isrepeated.
 6. The apparatus of claim 1, wherein the first length is onebyte greater than the second length.
 7. The apparatus of claim 1,wherein the first length is four bytes and the second length is threebytes.
 8. The apparatus of claim 1, wherein the first compressed datablock comprises a plurality of records, a record of the plurality ofrecords comprising a compressed repeated instance of data and anuncompressed portion of data of an input data set.
 9. The apparatus ofclaim 1, wherein the first compression engine is to generate the firstcompressed data block in response to an instruction from the secondcompression engine.
 10. The apparatus of claim 1, further comprising thesecond compression engine.
 11. The apparatus of claim 1, furthercomprising a battery communicatively coupled to the first compressionengine, a display communicatively coupled to the first compressionengine, or a network interface communicatively coupled to the firstcompression engine.
 12. A method comprising: hashing, by a firstcompression engine, a plurality of first portions of a data bock togenerate a plurality of first hash values, each first portion having afirst length; generating, by the first compression engine, a firstcompressed data block by compressing a first plurality of repeatedinstances of data of the data block, wherein the repeated instances ofdata each have a length greater than or equal to the first length,wherein the first plurality of repeated instances are identified basedat least in part on the first hash values; and providing, by the firstcompression engine, the first compressed data block for furthercompression by a second compression engine, the second compressionengine to hash a plurality of second portions of the first compresseddata block to generate a plurality of second hash values, each secondportion having a second length shorter than the first length, the secondcompression engine to compress a second plurality of repeated instancesof data of the first compressed data block that each have a lengthgreater than or equal to the second length, wherein the second pluralityof repeated instances are identified based at least in part on thesecond hash values.
 13. The method of claim 12, wherein each compressedrepeated instance of the first plurality of repeated instances comprisesa location and length of a data instance that is repeated.
 14. Themethod of claim 13, wherein each compressed repeated instance of thesecond plurality of repeated instances comprises a location and lengthof a data instance that is repeated.
 15. The method of claim 12, whereinthe first compressed data block comprises a plurality of records, arecord of the plurality of records comprising a compressed repeatedinstance of data and an uncompressed portion of data of an input dataset.
 16. The method of claim 12, wherein the first length is one bytegreater than the second length.
 17. At least one non-transitory machineaccessible storage medium having instructions stored thereon, theinstructions when executed on a machine, to cause the machine to: hash,by a first compression engine, a plurality of first portions of a databock to generate a plurality of first hash values, each first portionhaving a first length; generate, by the first compression engine, afirst compressed data block by compressing a first plurality of repeatedinstances of data of the data block, wherein the repeated instances ofdata each have a length greater than or equal to the first length,wherein the first plurality of repeated instances are identified basedat least in part on the first hash values; and provide, by the firstcompression engine, the first compressed data block for furthercompression by a second compression engine, the second compressionengine to hash a plurality of second portions of the first compresseddata block to generate a plurality of second hash values, each secondportion having a second length shorter than the first length, the secondcompression engine to compress a second plurality of repeated instancesof data of the first compressed data block that each have a lengthgreater than or equal to the second length, wherein the second pluralityof repeated instances are identified based at least in part on thesecond hash values.
 18. The at least one medium of claim 17, whereineach compressed repeated instance of the first plurality of repeatedinstances comprises a location and length of a data instance that isrepeated.
 19. The at least one medium of claim 18, wherein eachcompressed repeated instance of the second plurality of repeatedinstances comprises a location and length of a data instance that isrepeated.
 20. The at least one medium of claim 17, wherein the firstcompressed data block comprises a plurality of records, a record of theplurality of records comprising a compressed repeated instance of dataand an uncompressed portion of data of an input data set.